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Links in a web page may refer to web pages that are stored in the same or different host 
computers. 

A web crawler is a program that automatically finds and downloads documents from 
host computers in an Intranet or the world wide web. A computer with a web crawler 
installed on it may also be referred to as a web crawler. When a web crawler is given a set of 
starting URL's, the web crawler downloads the corresponding documents. The web crawler 
then extracts any URL's contained in those downloaded documents. Before the web crawler 
downloads the documents associated with the newly discovered URL's, the web crawler 
needs to find out whether these documents have already been downloaded. If the documents 
associated with the newly discovered URL's have not been downloaded, the web crawler 
downloads the documents and extracts any URL's contained in them. This process repeats 
indefinitely or until a predetermined stop condition occurs. 

Typically, to find out whether the documents associated with a set of discovered 
URL's have already been downloaded or are scheduled to be downloaded, the web crawler 
checks a directory of document addresses. These document addresses are URL's that 
correspond to documents which have either already been downloaded or are scheduled to be 
downloaded; for convenience, these documents will be referred to as downloaded 
documents. The directory stores the URL's of the downloaded documents, or representations 
of the URL's. The set of URL's in downloaded documents could potentially contain 
addresses of every document on the world wide web. As of 1999 there were approximately 
800 million web pages on the world wide web and the number is continuously growing. 
Even Intranets can store millions of web pages. Thus, web crawlers need efficient data 
structures to keep track of downloaded documents and any discovered addresses of 
documents to be downloaded. Such data structures are needed to facilitate fast data checking 
and to avoid downloading a document multiple times. 

Typically, the set of downloaded document addresses is stored in disk storage, which 

has relatively slow access time. One example of a method designed to facilitate fast data 

checking and to avoid downloading a document multiple times is disclosed in U.S. Patent 
a . r, • , rtio«J US fat* &3°b£W 

Application Senal No. 09/433,008, filed November 2, 1999. That document discloses storing 

address representations on disk, and using an efficient address representation to facilitate fast 
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consuming operation. Therefore buffer B 107 is typically made fairly large so as to minimize 
the frequency of such merge operations. 

During the merge process, which is an ordered merge, fingerprint N k must be inserted 
in the fingerprint disk file 1 13 in the proper location, as illustrated in Figure 6, so that the 
5 disk file 1 13 remains ordered. This requires the disk file to be completely re-written. To 
avoid this lengthy rewrite process, in a preferred embodiment, the fingerprint disk file may be 
sparsely-filled, using open addressing. For this embodiment, the fingerprint disk file 
represents a hash table, with a substantial proportion of the table, for example 50% or 75%, 
being empty entries, or "holes." 
10 In this embodiment, in order to determine whether a particular fingerprint N k is in the 

disk file, the hash of the fingerprint is computed. In one embodiment, only a prefix of the 
fingerprint is used for the hash value. The hash value is the starting position for searching 
through the fingerprint disk file. The disk file is searched sequentially, starting at the starting 
^tj position, for either a match or a hole. If a hole is found, the fingerprint N k is stored in that 

i j± 15 hole; if a match is found, N k is discarded. Thus, there is only one write to the disk file for 

each fingerprint not already present in the disk file, and the size of the disk file is not a factor 
1,3 in the merge time. When the disk file becomes too full - for example, when only 25% of the 

j'j slots in the disk file are holes - the file must be completely rewritten into a new, larger file. 

;f For example, the new file may be doubled in size, in which case the amortized cost of 

!3 20 maintaining the file is constant per fingerprint in the hash table. It will be appreciated that the 
use of open addressing a sparsely-filled disk file drastically reduces the disk re- writing 
required during a merge. 

In one embodiment, the disk file may be divided into sparse sub-files, with open- 
addressing used for each sub-file. An index may be used to identify the range of fingerprint 
25 hash values located in each sub-file, or an additional hash table may be used to map 

fingerprints to the various sub-files. When a sub-file becomes too full, it may be re-written 
into a new, larger file, but the entire disk file need not be re-written. 

In another aspect of the present invention, an efficient addressing scheme may be used 

X f 

/ y f ^ for either a sparse disk file, or a disk file consisting of a set of sparse sub-files. In this 

30 addressing scheme, discussed in U.S. patent application 09/433,008, filed November 2, 1999 

(hereby incorporated by reference in its entirety), each fingerprint is composed of two 

X- 
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